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ABSTRACT 


This report presents the results of a one-vear study to investigate and develop 
techniques for efficiently representing scanned electronic documents. Major results 
include the definition and preliminary performance results of a Universal System for Effi- 
cient Electronic Mail (USEEM), offering a potential order of magnitude improvement over 
standard facsimile techniques for representing textual material. 
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THE DEVELOPMENT OF EFFICIENT CODING 
FOR AN ELECTRONIC MAIL SYSTEM 

I. INTRODUCTION 


PROGRAM OVERVIEW 

This paper presents the results of research by the Information Processing 
Research Group of the Jet Propulsion Laboratory for the U.S. Postal Service (USPS) 
from April 1 980 to April 1981. This research was directed towards the development of 
efficient preprocessing and coding algorithms for binary facsimile imagery. 

Objectives 

The program objectives at inception were to provide algorithm specifications, 
characteristics, and functional implementations which would lead to the realization of a 
prototype model, by others, for test and evaluation. Such a model would later be incor- 
porated within a planned USPS Electronic Message Service System. The stated 
algorithm performance goal was to achieve a 2: 1 improvement over the CCITT (Interna- 
tional Consultative Committee for Telephone and Telegraph) standard technique for 
one-dimensional facsimile coding. 

Redirection. A December 1 980 interim presentation of the results of algorithm 
investigations introduced a new concept called a "Universal System for Efficient Elec- 
tronic Mail" (USEEM). These initial results promised such overwhelming potential for 
improved performance that the USPS redirected JPL to concentrate its remaining efforts 
on defining USEEM further. Functional implementation diagrams, which would allow a 
prototype system to be built, would instead be completed during anticipated follow-on 
programs. 


SUMMARY OF MAJOR RESULTS t 
Noiseless Coding 

A constraint that binary facsimile images must be reconstructed precisely (i.e., 
reversible coding, noiseless coding) bit-for bit,, will limit achievable compression factors 
to the following: 


1 200 lines/inch compression factors, higher for 300 lines/inch. 
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Less than 10:1 on text. 


( 1 ) 




• An average of 20:1 over all documents. (2) 

Further, the stated goal of a 2:1 improvement in performance over the CCITT one- 
dimensional standard is not generally possible on textual material. While JPL noiseless 
coding techniques combined with the CCITT standard one-dimensional run-length pre- 
processing did improve performance, major gains, as in { 1 ) and (2), can only be achieved 
by utilizing two-dimensional preprocessing techniques. Subtle variations in performance 
exist among all the best techniques but they are really quite minor. They all fit the sum- 
mary conclusions in (1) and (2). 

USEEM 

A concept for a Universal System for Efficient Electronic Mail (USEEM) was con- 
ceived which: 


Pan nnrf A»*rvi 
vui i poi ivi 


m the noiseless coding function as above, but by: 


/ i 


• removing the restrictive bit-for-bit noiseless constraint, 

• utilizing unsupervised-character-recognition, and 

• utilizing adaptive noiseless coding of "text". 

This concept offers the potential for: 

• 100:1 compression on dense text (higher on lighter documents), (4) 

• improved reproduced quality, (5) 

• compatibility with direct entry electronic mail with (6) 

• automated translation to word processor form, and 

• efficient coding of direct entry text. 

In addition, the further development and subsequent implementation of 
USEEM could be scheduled in states corresponding to distinct modular increments 
to performance. (7) 
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Observe that the major result of (4) in essence means that USEEM offers the poten- 
tial for an order of magnitude reduction in the transmission and buffering requirements 
for a predominant component of digitized electronic mail.t 

Cost Impact!^ 

It has been estimated that a future facsimile-based USPS electronic mail system 
would handle between 3 and 25 billion messages per year. Estimates of high-speed 
storage and communication channel annual costs for standard facsimile compression 
techniques and USEEM are compared below in Table 1 and Table 2 assuming 1978 
dollars. 


Table 1 . Annual Cost Comparison 3 Billion Messages/Year. 



Ann..*.! IA H A!M! -1 O'? d \ 

Annual uuoia \ «? IVIIIIIUI1 157/0/ 

High-Speed 

Storage 

Communication 

Channel 

Total 

Standard 

2-Dimensional 

Compression 

1.3 

7.0 

8.3 

USEEM 

0.4 

1.2 

1.6 

Cost 

Reduction 

0.9 

5.8 

6.7 


tThis is roughly a factor of "twenty" improvement over the CCITT one-dimensional 
standard. 


Table 2. Annual Cost Comparison 26 Billion Messages/Year. 


f 



Annual Costs ($ Million 1978) 

High-Speed 

Storage 

Communication 

Channei 

Total 

Standard 

2-Dimerslonal 

Compression 

6.1 

37.0 

43.1 

USEEM 

1.9 

6.0 

7.9 

Cost 

Reduction 

4.2 

31.0 

35.2 


Note that in 1 990 dollars (a time frame closer to a full USEEM implementation) the 
savings at 25 billion messages per year would be over $1 billion in a ten-year period. 
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II. NOISELESS FACSIMILE COMPRESSION 


This section investigates the relative and absolute performance of various alterna- 
tive algorithms designed to provide efficient noiseless representations of the binary im- 
ages which are characteristic of electronic mail systems. The results of extensive simu- 
lations using both familiar, well documented techniques as well as new approaches are 
preceded by some necessary background development. A summary of major observa- 
tions is given at the end of this section. 

BACKGROUND 

A data sequence can be said to be coded noiselessly when the original data se- 
quence can be recovered, without error, from the coded one. Noiseless coding to 
achieve "data compression" can be applied to advantage on virtually any data source 
which exhibits statistical characteristics having memory and/or non-uniform symbol 
probability distributions. For example, grey scale images have memory because adja- 
cent picture elements {pixels) tend to be similar. Consequently, a sequence of differ- 
ences between adjacent pixels will be distributed about zero in a unimodal fashion. Use 
of a variabi ' “vi;th code allows shorter codewords to be assigned to the smaller differ- 
ence vaiu' o which occur more often. On the average the number of bits required may be 
much less than if a fixed length representation had been used. These same two steps of 
preprocessing to make use of source memory {differences here) and assigning code- 
words apply to numerous problems. 

Application of noiseless coding to images which only have two brightness values is 
known as predictive facsimile data compression. 


JPL Noiseless Coding Software 

Other JPL work resulted in general purpose adaptive variable length coding tech- 
niques which were expected to be applicable to the facsimile problem when combined 
with appropriate preprocessing algorithms. [2]-[3] y 0 facilitate investigations of this 
possibility, JPL placed these variable length coding techniques in an easily used Fortran 
software package. A description is provided in Ref. 4 and Appendix A. 
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Measures of Performance 




The "entropy" of a probability distribution with values pq, p-j , P2, ... p q Is given by 


H = 



Pi 


( 8 ) 


and represents an upper bound to the average performance of any noiseless coder in 
representing a long sequence of symbols which are generated Independently from some 
unchanging data source with these symbol probability values. If the conditions of inde- 
pendence and stationarity (unchanging statistics) are met, then H in (8) represents the 
best that any noiseless coding operation can do. However, when these conditions are 
not met, performance under H may be possible. In this case, H is a practical guide to 
good performance of a coder which does not utilize knowledge of this source memory or 
non-st.ationarity to its advantage. Hence, using H as an absolute bound to performance 
should be viewed with caution. 

Example. As an example, a binary image whose right half is white and left half black 
has the same average symbol probability distribution (pQ = Vi, p-j = Vi) as an image of 
randomly occurring black and white spots. An entropy calculation based on this distribu- 
tion would yield an entropy of 1 bit/sampie. However, the first source could be coded 
with nearly zero bits in each half by recognizing the memory and/or non-stationarity in 
this source (the right half has pq = 1, Pi = 0, H = 0; and the left half has pq - 0, 
Pi = 1 , H = 0). 


Entropies in this task. In the noiseless coding of binary images containing text and 
graphics we will use entropy measures for average symbol distributions which emanate 
from preprocessors which have already made use of source memory. The entropies then 
represent good guides to performance assuming that particular preprocessor is used. 

For example run-length preprocessing is a widely used and effective technique for 
facsimile data compressionl 5 H6] t One-dir..ensional run-length entropies will be used 
here as a guide to facsimile compression ratios for algorithms which do not make use of 
information from adjacent lines. We will also quote entropy values associated with the 
output of two-dimensional run-length preprocessors of binary data. Finally, to deal with 
efficient representations of textual materials we will use entropies based on distribu- 
tions of characters, etc. In each case the entropies are guides to good performance if the 
preprocessing is followed by appropriate variable length coding. This general subject is 
more thoroughly treated in Refs. 2-4 and Appendix A. 
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Test Sets 


The test sets used to investigate performance included: 

a) Four pictures (Fig. 1) from the U.S. Postal Service, digitized to both 200 and 
300 points/inch (p/inch); 

b) the standard eight document CCITT test set (Figs. 2 and 3) digitized to 200 
p/inch? 7 ?; 

c) several pages of typical word processor output, digitized to 200 p/inch by 
the Navai Ocean System Center. 

INVESTIGATIONS OF DOCUMENTED ALGORITHMS 
CCITT 1-D Standard??] 

In this standardized approach run-lengths are generated for sequences of black and 
white runs along individual lines, as illustrated in Fig. 4. 

The black runs are coded with a separate standardized variable length code which 
has been optimized with a goal of efficiently representing black runs at an entropy of 
bits/run. The white runs are also coded with a separately optimized code with a goal of 
H w bits/run. The code words are concatenated ( * in Fig. 4) to produce an overall output 
sequence. 

The "one-dimensional ( 1 -D) run-length entropy" for the binary source itself is given 
by 


h 1D 


Average Bits/Run 
Average Input Pixels/Run 


H w + Hfo 

— bits/ pixel 

r w + r b 


(9) 


where r w and r^ are the average run-lengths for white and black runs respectively. 
Bell Labs Markov Prediction? 8 ^ 


The basic Markov prediction approach by Bell Labs is illustrated in Fig. 5. 

Coding is accomplished by 1-D run-length coding of the error sequence resulting 
from this process and some useful tricks in ordering the runs.? 8 ! 
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Mr m. J Hiller, Director 

Office of Advanced Halt System Development 

11711 F*# rk 1 awn Avenue 

Rockville, Maryland 20852 

Gent lemon 

This is a sample of the letter we propose to use as a 
'standard" for Imaging experiments at NEIC, San 0<ego. It 
was made on a Wang System 1222 Dual Cassette Typewriter 
which consists of a modified IBM Swlectrlc typewriter, two 
cassette holders, and a magnetic core memory capable of 
storing pages of data such as this letter, The cassette 
tapes are being made to store the data for each character 
In United States of America Standard Code for Information 
Interchange (USASCI1) format. This Is a standard seven 
bit binary code for each character which is widely used In 
industry. In USASCI I form this page as written can be 
exactly defined by 15099 bits of data (excluding signature, 
logo or header 1 nformatlon) . When scanned at 200 x 200 
picture elements per Inch with six bits per element for 
grey scale the page Is defined by 22,440,000 bits. 

By recording the contents of this letter on cassette 
tape, It is possible to reproduce a quantity of duplicate 
originals, all nominally exactly the same. Since the 
typewriter Is an IBM Seleetrlc It is also possible to 
change the type font without changing the messaoe. It is 
also possible to change ribbons (a five- or ten-minute 
process) to yield copies of differing colors. It is of 
course possible to urlte on all textures, colors and 
welqhts of paper with or without letter head. It will also 
allow copies of this text to be analyzed both with and 
without signatures of various colors. 

This ability to provide complete parameter selection ano 
consistency control for analysis of thresholds, contrasts, 
color separation, comprossab 1 1 1 ty coefficients, and char- 
acter fonts will be of great benefit In quantifying the 
requirements of U. S. Postal Service Scanner technology 


f rank Martin 
NUC Code 3100 
Problem N451 


Fig. 1 . Postal Service Test Set. 
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Fig. 4. Standard CCITT 1-D Coding. 


a l 

a 2 

a 3 

a o 

x 

m mmm m*.** 


LINE n-1 


LINE n 


PREDICT x BASED ON PREVIOUS 16 STATES DEFINED BY a Q ^ a ? a y 
CHOOSING x TO BE THE MOST LIKELY VALUE GIVEN a Q a } a 2 a 3 


Fig. 5. Basic Markov Predictor. 


The Bell Labs work did not actually code the runs but instead relied on entropies as 
an estimate. Stated results in subsequent tables follow this limitation. However, appro- 
priate application of JPL variable length coding led to results within 10% of these en- 
tropy values. Additional modifications to the prediction process made up the additional 
10 %. 
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IBM Dual Mode^ 


The basic idea of this approach is to make use of the observation that a run in one 
line will tend to be closely followed in the next. The error ii' this prediction can usually be 
variable length coded to advantage. When it can't, the system defaults to a standard, 
one-dimensional, run-length mode. 

Performance Runs 

Tables 3-6 show the results of applying these algorithms to the Postal Service test 
set of 4 images using both horizontal and vertical scan lines at 200 and 300 points/inch. 


ADDITIONAL APPROACHES 
Replacing the CCITT Standard Code 

A direct replacement of the CCITT one-dimensional, standard, variable length code 
(see Fig. 4) with an appropriate JPL algorithm (Appendix A) yielded average perfor- 
mance lying half-way between one-dimensional run-length entropies and the corres- 
ponding performance of the CCITT code. 

Histogram update. The input to a JPL variable length coder is the integers 0, 1,2,... 
which is the result of mapping the most likely run-lengths into the smaller integers. The 
condition 


P 0 > P-j > P 2 ••• HO) 

should be maintained to assure the best variable length coding performance. This can be 
done adaptively by adjusting the run-length mapping sample-by-sample to reflect any 
changes in run-length probabilities. This adjustment can be accomplished without any 
additional data rate requirements since the necessary "histogram update" needs only 
previous samples. The result of this configuration, illustrated in Fig. 6, was performance 
uniformly within 


2% of the 1-D Entropy. 
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Table 3. Baseline Performance: 200 Points/Inch, Vertical Scan. 


IMAGE 

NAME 

SIZE 

LINES 

X 

PIXEL 

ALGORITHM PERFORMANCE IN BITS/PIXEL (B/P) 

l-D 

RUN-LENGTH 

ENTROPY 

CCITT 

RUN-LENGTH 

CODING 

ENTROPY BELL LABS ALGORITHMS 

IBM 

DUAL MODE 
K “ 4 

IBM 

DUAL MODE 
K ■ oo 

l-D 

RUN-LFNGTH 
CODING OF 
ORDERING 
WITH REF, 
TO PREV. 
LINE 

l-D 

RUN-LENGTH 
CODING OF 
2-STATE 
ORDERED 
PRED. ERR. 

l-D 

RUN-LENGTH 
CODING OF 
14-STATE 
PREDICTED 
ERRORS 

l-D 

RUN-LENGTH 
CODING OF 
14-STATE 
ORDERED 
PRED. ERR. 

WJM 

TYPED 

1700 

X 

2200 

0.091 

0. 1 M 

0.078 

0.072 

0.085 

0.044 

o.oai 

0.053 

GWU 

TYPED 

1700 

2200 

0. 1 IS 

0.139 

0.080 

0.081 

0.075 

0.076 

... -. _ 

0.095 



0.073 

BUNYAN 

WRITTEN 

1700 

X 

2200 

0.049 

0.089 

0.040 

0. 053 

0.054 

0,037 

0.054 

0.038 

FORM 

1700 

2200 

0. 197 

0.238 

0. 101 

0.093 

0.097 

0.0P3 

0.138 

0.094 

average b/p 

0.110 

0. 145 

0.082 

0.075 

0.078 

0.04B 

0.092 

0.047 

AVERAGE 

COMPRESSION 

FACTOR 

4.9 

12.2 

13.3 

12.8 

14.7 

10.9 

14.9 


Table 4. Baseline Performance: 


200 Points/Inch, Horizontal Scan. 


IMAGE 

NAME 

SIZE 

LfNES 

X 

PIXEL 

ALGORITHM PERFORMANCE IN BITS/PIXEL (B/P) ' 

1-D 

RUN-LENGTH 

ENTROPY 

CCITT 

RUN-LENGTH 

CODING 

ENTROPY BELL LABS ALGORITHM 

IBM 

DUAL MODE 
K -4 

IBM 

DUAL MODE 

K « oo 

1-D 

RUN-LENGTH 
CODING OF 
ORDFRING 
WITH REF. 
TO PREV. 
LINE 

1-0 

run-length 
CODING OF 
2-STATE 
ORDERED 
PRED. ERR. 

1-0 

RUN-LENGTH 
CODING OF 
14-STATE 
PREDICTED 
ERRORS 

1-D 

RUN-LENGTH 
CODING OF 
14-STATE 
ORDERED 
PRED. ERR. 

WJM 

TYPED 

2200 

X 

1700 

l I 

0.128 

0.078 

0.069 

0.083 

0.046 

0.087 

0.043 

GWU 

TYPED 

2200 

X 

1700 

0.097 

0.118 

0.080 

0.073 

0.084 

0.071 

0.085 

0.044 

BUNYAN 

WRITTEN 

2200 

1700 

0.053 

0.070 

0.059 

0.055 

0. 053 

0.042 

0. OSS 

0.041 

FORM 

2200 

X 

1700 

0.134 

0. 157 

0.087 

0. 073 

0. 088 

0.077 

0.103 

0.074 

| AVERAGE B/P 

0.097 

0. 118 

0.074 

0.047 

0.077 

0.044 

0.082 

0.041 

AVERAGE 

COMPRESSION 

FACTOR 

8.5 

13.2 

14.9 

13.0 

15.4 

12.2 

14.4 
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Table 5. Baseline Performance: 300 Points/Inch, Vertical Scan. 


ALGORITHM PERFORMANCE IN BITS/PIXEL (B/P) 


IMAGE 

NAME 

SIZE 

LINES 

X 

PIXEL 

l-D 

RUN-LENGTH 

ENTROPY 

CCITT 

RUN-LENGTH 

CODING 

ENTROPY BELL LABS ALGORITHMS 

IBM 

DUAL MODE 
K “4 


1-D 

RUN-LENGTH 
CODING OF 
ORDERING 
WITH REF. 
TO PREV. 
LINE 

1-D 

RUN-LENGTH 
CODING OF 
2-STATE 
ORDERED 
PRED. ERR. 

1-0 

RUN-LENGTH 
CODING OF 
16-STATE 
PREDICTED 
ERRORS 

1-D 

RUN-LENGTH 
CODING OF 
16-STATE 
ORDERED 
PRED. ERR. 

IBM 

DUAL MODE 
K “ <*> 

WJM 

TYPED 

25 SO 

X 

3304 

0.069 

0.090 

0.058 

0.052 

0.062 

0.046 

0.060 

0.046 

GWU 

TYPED 

2550 

X 

3304 

0.0R8 

0.097 

0.065 

0 058 

0.060 

0.051 

0.068 

0 051 

BUNYAN 

WRITTEN 

2550 

3304 

0.052 

0.072 

0,045 

0.039 

0.041 

0.026 

0.042 

0.027 

FORM 

2550 

33*04 

0,147 

0. 198 

0.075 

0.068 

0.073 

0.068 

0.104 

0.068 

AVERAGE B/P 

0.069 

0.114 

0. 06* 

0.054 

0.05) 

0.048 

0.068 

0.048 

AVERAGE 

COMPRESSION 

FACTOR 

8.6 

16.4 

18.5 

16.4 

20.8 

14.7 

20. t 


Table 6. Baseline Performance: 300 Points/Inch, Horizontal Scan. 


IMAGE 

NAME 

SIZE 

LINES 

X 

PIXEL 

ALGORITHM PERFORMANCE IN BITS/PIXEL IB/PI 

1-D 

RUN-LENGTH 

ENTROPY 

CCITT 

RUN-LENGTH 

CODING 

ENTROPY BELL LABS ALGORITHMS 

IBM 

DUAL MODE 
K -4 

IBM 

DUAL MODE 

K = oo 

1-D 

RUN-LENGTH 
CODING OF 
ORDERING 
WITH REF. 
TO PREV. 
LINE 

1-D 

RUN-LENGTH 
CODING OF 
2-STATE 
ORDERED 
PRED. ERR. 

1-D 

RUN-LENGTH 
CODING OF 
16-STATE 
PREDICTED 
ERRORS 

1-D 

RUN-LENGTH 
CODING OF 
16-STATE 
ORDERED 
PRED. ERR. 

WJM 

TYPED 

3304 

25*50 

0.078 

0. 103 

0.057 

0.051 

0.061 

0.046 

0.064 

0.045 

GWU 

TYPED 

3304 

X 

2550 

0.077 

0.097 

0.059 

0.053 

0.062 

0.050 

0,063 

0.046 

BUNYAN 

WRITTEN 

3304 

X 

2550 

0.040 

0.054 

0.044 

0. 040 

0.039 

0.031 

0.041 

0. 030 

FORM 

3304 

25*50 

0.102 

0.121 

0.065 

0.057 

0.065 

0.054 

0.075 

0.052 

AVERAGE B/P 

0.074 

0.094 

0.056 

0.050 

0.057 

0.045 

0.061 

0.043 

AVERAGE 

COMPRESSION 

FACTOR 

10.6 

17.8 

20.0 

17.5 

22.2 

16.4 

23.3 
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Fig. 6. Histogram Update Coding. 


Markov/Histogram Update/JPL Coding 

The Histogram Update/JPL Coding was applied to the output of a Beil Labs Markov 
predictor (Fig. 7), yielding performance within 10% of the two-dimensional prediction 
entropy. 


Directed Markov Prediction 

We also investigated an extension of the Bell Labs approach called Directed Markov 
Prediction (Fig. 8). In general this prediction technique reduced two-dimensional entro- 
pies by 10 to 1 5% and improved performance correspondingly. This approach is dis- 
cussed further in Appendix B. 

Segmentation Coding 

Segmentation coding is illustrated in Fig. 9, which shows a small region of a docu- 
ment containing the word "where". Segmentation representation of a document con- 
sists of: 

a) Identifying such regions of text or non-text, 

b) Further partitioning character strings into individual regions containing 
single characters, 
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Fig. 7. Markov Predictor/Update Code. 
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Fig. 8. Directed Markov/Update Code. 
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Fig. 9. Segmentation Coding. 
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c) Coding information defining the size and location of each region, 

d) Coding the interior of each region (called a prototype bit pattern) using 
modified versions of the predictive coding algorithms just discussed. 

• The major observation here is that the performance of this approach 
is equivalent to the best two-dimensional predictive techniques that 

don't segment. (12) 

• A secondary observation to be used later is that the interior of 
regions containing 200 p/inch characters can be typically coded 

with an average of 135 bits/char. (13) 


MAJOR OBSERVATIONS 

If we supplement the latter investigations with additional runs using the CCITT 
standard test set, we can make the summary observations given below. 

One-Dimensional Techniques 

• Entropies 

• Vertical scan is 20% more than horizontal. 

• Entropy at 300 p/inch is 1 .74 times the 200 p/inch entropy. 

• CCITT standard at 200 p/inch 

• Coding is 4-25% above entropy. 

• Compression factors range from 5:1 to 16:1. 

• The average compression factor is 10:1. 

Two-Dimensional Techniques 

9 


All sophisticated techniques are quite similar in performance. This includes 
segmentation coding (see (12) and (13)). 


The performance difference between vertical and horizontal scon decreases 
to 5%. 


• Compression factors range from 7:1 to 35:1 . 

• The average compression factor is 20:1 . This is a 2:1 gain over the 1-D 
CCITT Standard. 

• These techniques cannot provide a 2:1 improvement on dense text (only 
33%). 




III. A UNIVERSAL SYSTEM FOR EFFICIENT ELECTRONIC MAIL (USEEM) 

The removal of a constraint that binary facsimile images must be reconstructed pre- 
cisely leads to the possibility of the substantial performance advantages noted in (4) and 
(5) and the cost savings noted in Tables 1 and 2. This potential resides' in a concept 
called a "Universal System for Efficient Electronic Mail (USEEM)" which achieves these 
notable results primarily from the use of unsupervised character recognition and the 
adaptive noiseless coding of text. It represents an extension of similar work pioneered 
by W. Pratt, P. Capitant, W. Chen, E. Hamilton and R. Wallis of Compression Labs, 
lncJ 1c ^. Some of their original algorithms were used intact whereas others were sub- 
stantially modified or replaced to achieve improved characteristics. 

This section first provides an introduction and summary of the USEEM concept. Fur- 
ther details on performance, segmentation, character recognition and the coding of text 
follow. 


INTRQDUC HON/SUMMARY 
Functional Capability 

USEEM can be described as a modular, six-level system concept where each higher 
numbered level provides some further capability to communicate electronic mail. It is ex- 
pected that each such incremental supplement to functional capability will correspond 
to an implementation increment as well. That is, a lower level may be implemented with 
the expectation that the advantages of a higher level may be ad'^d later without requir- 
ing complete redesign. 

Figure 10 illustrates the modular form of USEEM. The primary data flow for the 
added capability at each step is indicated by heavy arrows. 

The first level of USEEM, labelled Predictive Coding, refers to all the standard one 
and two-dimensional techniques for noiselessly coding (bit-by-bit, exact reconstruction) 
scanned and digitized binary images. These include the familiar techniques of run-length 
coding, Markov state prediction, etc., as discussed in Section II. When averaged over all 
forms of potential electronic mail, the best of these techniques can be expected to re- 
duce the bit rate requirements by about 20: 1 , compared to the uncoded binary output of 
a Postal Service scanner. However, this advantage drops to about 7:1, when dense text 
is considered. 
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Fig. 10. Six Levels of USEEM. 









USEEM Level 2 is not expected to provide performance improvements in the ability 
to noiselessly code electronic mail. However, it will provide a functionally crucial capa- 
bility to enable the significant performance advantages of higher USEEM levels. 

USEEM Level 2 (Segmentation Coding, Section II) achieves essentially the same 
performance as Level 1 by t. ,t partitioning a document into all white regions and 
regions containing black and white transitions. Since the vast majority of electronic mail 
will contain principally typewritten text, the primary consequence of this "segmenta- 
tion" procedure is to place rectangular blocks around characters. USEEM Level 2 then 
identifies the location of each segmented region (usually a character) and sends a coded 
message identifying its contents. The latter coding procedure may be derived from the 
collection of predictive techniques for a Level 1 USEEM. 

USEEM Level 3 is really a consequence of the principal innovation of Level 4. It 
offers the possibility for some improvement in performing noiseless coding of electronic 
mail but as yet has not been investigated enough to allow an accurate estimate of this 
potential. 

USEEM Level 4 clearly offers the potential for a significant improvement in the abil- 
ity to communicate and store electronic mail. To achieve these gains, we must replace 
the restrictive fidelity criterion of "bit-by-bit exact reconstruction" by the looser, but de- 
manding, criterion, namely, that the user be satisfied with the product he gets. This 
simply returns to the basic system requirements for electronic mail and, as will later be 
illustrated, can actually result in better fidelity. 

USEEM Level 4 initiates a "prototype" library by storing the bit patterns of "new" 
incoming segmented characters. An incoming cha -acter is new if it does not look like 
any prototype already stored in the library. The bit pattern for a new character is com- 
municated with the predictive coding techniques noted for USEEM Level 2. However, 
when an incoming character has been perceived to resemble a prototype in the library, 
only an identifier of which library character has been observed is needed. A decoder will 
output a replica of the corresponding prototype bit pattern from its own library, when it 
receives this identifier. The effect on performance can be quite significant since a proto- 
type character bit pattern may require an average of 1 35 bits, whereas a library identi- 
fier needs only 8. JPL's initial results indicated that greater than 90% of a document's 
characters could be expected to be communicated using a library identifier. On the most 
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difficult form of document, one containing dense text, this yields overall compression 
factors of 45: 1 to 70: 1 versus only 7:1 by the best facsimile techniques (USEEM Levels 
1 and 2).t 


Character recognition. The criterion for judging an incoming character alike or unlike 
a prototype library character must almost totally prevent a character, say an "a," from 
being interpreted as another form, say a "c." A USEEM Level 4 would replace the "a" 
by a "c" in this case with unacceptable results. Replacing one "a" with another (of the 
same font) has little impact. The character recognition techniques developed so far for 
USEEM have sought to maximize the occurrence of replaceable characters while 
minimizing the occurrence of such errors. Although tested on only a limited data set, the 
results fully support these objectives and the Level 4 performance projections noted 
above. Testing was performed on svsveral medium density documents containing l 590 
segmented characters per page. By the first page, 95 percent of the characters were 
correctly identified as replaceable by a library prototype. Most of the remaining 5 per- 
cent represented first-time library entries because the library was initially empty. By the 
second page almost 99% of the characters were correctly identified as replaceable by 
library characters. The latter percentage corresponds to compression factors exceeding 
200 : 1 . 

USEEM Level 5 seeks to capitalize o r ' the redundancy existing in language by apply- 
ing noiseless coding to characters and words resulting from Level 4 operations. This can 
directly reduce the bit cost of prototype library Identifiers 2 to 3 times, as evidenced by 
initial simulations. In combination with an improvement in the repeat fraction for dense 
text to 0.95, the noiseless coding of text could improve overall compression factors to 
1 00: 1 . For medium-density text, compression factors of over 200: 1 on the first page of 
a document and over 400:1 on subsequent pages could be expected. All figures repre- 
sent roughly an order of magnitude improvement over the best two-dimensional 
(noiseless) facsimile techniques (USEEM Levels 1 and 2). 

By appropriate segmentation of characters the strings of characters exiting a 
USEEM Level 4 processor will look much like strings of characters exiting typical auto- 
mated office equipment. Consequently, the efficient character coding algorithms devel- 
oped for Level 5 should apply to most forms of electronic mail entered directly from ter- 
minals. USEEM Level 6 is merely a statement of this observation. A 2-to-3:1 reduction 
in the bits needed to communicate this nonscanned form of input should be possible 


tNote that a USEEM Level 4 can still functionally execute the noiseless coding require- 
ments of Level 1 and 2. 
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Quality 

USEEM levels 1-3 reduce the number of bits required to represent incoming data 
while yielding exact bit-by-bit reconstruction of that data. An exact replica of the binary 
image produced by a Postal Service scanning device will be reconstructed as a final pro- 
duct. The constraint which requires exact reconstruction is a benevolent attempt to en- 
sure that final product; unfortunately it may result in mediocre reconstructed quality in 
practice. This point is ''lustrated in the photograph in Fig. 1 1 showing two sequences of 
the lower case letter "a." 

A close inspection will reveal that the sequence of ten a's at the top is clearly o? 
better quality than those at the bottom. Further, the better set at the top can be com- 
municated at one-fifth the data rate as the inferior set at the bottom can be. A similar 
statement could be made for storage requirements. The "exact reconstruction" limita- 
tion would eliminate this preferable option of better quality along with significant reduc- 
tions in rate requirements since it would preclude USEEM Levels 4-6. 

Actually, the inferior quality set represents ten lower case a's taken from a docu- 
ment scanned at 200 points/inch and are reproduced "exactly," bit-by-bit. The im- 
proved set of a's simulates the application of USEEM Level 5 to documents scanned at 
300 points/inch. The last nine a's are replicas of the first, simulating the use of a proto- 
type character library as discussed earlier. The reduction in data rate requirements by a 
factor of five takes into account both the increased number of pixels at the higher 
resolution and the significant improvement in compression factors from USEEM Level 5. 
To achieve 300 points/inch quality under an "exact reconstruction" criterion would, by 
comparison, require increasing the data rate and buffering requirements by almost a fac- 
tor of two. 

Added advantages. An indirect consequence of the character segmentation process 
(USEEM Level 2) is actually an improvement in text readability at a given scan resolu- 
tion. The segmentation process can be used to straighten slightly crooked lines and to 
separate "run-together" characters. 
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USEEM PERFORMANCE ANALYSIS 


Origin of the Bit Costs 

Consider the three components of data emanating from a USEEM library shown in 
Fig. 12. 


N SEGMENTED 
CHARACTERS 


USEEM LIBRARY 


NEW. 

Ft 


NEW PROTOTY PE 


135 

BITS/CHAR 


0<7< 1 


OLD, 


7 


8/A 

BITS/CHAR 


N(1 -7X135) BITS 

— — 


N (7) (8/A) BITS 

ig*. 


SEGMENTATION COST N/3 BITS 


A ■ TEXT COMPRESS ION FACTOR 

(3 - SEGMENTATION COST IN BITS/CHAR 

7 = FRACTION OF INCOMING CHARACTERS REPLACEABLE BY 
LIBRARY CHARACTERS 


Fig. 1 2. Bit Costs from Library. 


Identical library identification systems reside at the sending and receiving sites. 
Each system contains a library of prototype characters which (for now) are assumed 
empty at the start of a document. Characters entering the identification system are 
tested against the library prototypes to determine whether they are "OLD" or "NEW". 
An "OLD" character looks enough like a library character to be represented by it. in this 
case only a library identifier need be communicated to notify the receiving end that it can 
reconstruct the specific library prototype. A 256 character library needs only 8 bits/ 
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character, if a fixed length code is used. Use of noiseless text compression by a factor X 
would reduce this to 


8/X bits/character (14) 

A "NEW" character does not look like one of the existing prototypes in the library 
and must be communicated as a bit pattern. Using modified predictive facsimile com- 
pression algorithms from Section II, 200 points/inch characters can typically be repre- 
sented with 


1 35 bits/character (15) 

including cost for identifying the size of the character. 

A "NEW" character is also added to the prototype library (replacing the oldest entry 
if necessary) at both the sending and receiving sites. 

The parameter 


7, 0 < 7 < 1 (16) 

is used to specify the fraction of incoming characters which can be communicated as 
library identifiers (OLD). Achieving a high value of y will soon materialize as the key to 
high performance. 

An additional "segmentation cost" of 

0 bits/character (17) 

is associated with identifying the location of each character bit pattern. 

Then from Fig. 12, N segmented characters will require 

N [/3 -F (135) (1 -7) + -5- 7] bits (18) 

A 


to code. 




Text density. 4000 characters on a standard 8-’/a x 1 1 " page is extremely dense 
text, particularly at a scan density of 200 points/inch. 1 500-2000 characters on a page 
can be considered medium-density text and 800 characters/page as light text. Such 
densities can more generally be expressed in characters/pixel by 

_ Number of characters 

D t = — r — ~ — - — : 19 

i {no pixels in area 

containing characters) 

For example, Dy for dense text is given by 4000/3.7 x 10® * 0.001. 

Text performance equation. Substituting (19) in (18) we obtain the performance 
equation for *sxt regions given by 



= Ry = Dy (/3 + (135) (1 - 7) + 87/X] bits/pixel 


( 20 ) 


where we recognize wy as the text compression factor, co will similarly be used in subse- 
quent equations. 


USEEM performance equation. Non-text regions are coded by the two-dimensional 
techniques of Section II. We denote the corresponding rate for non-text regions by 


- — = R mx bits/pixel. 

0J N y N I 


( 21 ) 


Then letting 


a 


( 22 ) 


denote the fraction of a document which is non-text we get the USEEM performance 
equation t 


_ 1 _ 

"U 


= Ry = aR N y + (1 -a)Ry 


(23) 


where Ry is given by (20). 


t Observe, we have left out the situation where some regions of text are incorrectly 
treated as non-text. However, such text would then be coded by 2-D facsimile with 
roughly the same results as being treated as text and being coded as "NEW” proto- 
type patterns (see Section II). 
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Graphical Analysis 


We will now use these equations to graphically investigate the impact of param- 
eters cc, Dy, (3, 7, and X on USEEM compression factors. 

• a: Changing a varies the ratio of text to non-text in a document. When a - 0, a 
document consists of text only. I - 

• wjsjj: When considering USEEM non-text regions we will assume the two-cases 
where facsimile compression factors are =7:1 (dense) and = 20:1 
(medium). 

• D t : Text density values will correspond to dense, medium and light text, as 
defined above. 

• /3: We will consider the two values 10 and 0 for the cost in segmenting char- 
acters. The former (/3 = 10 bits) is a requirement of a Compression Labs, 
Inc.^^ process called here “character segmentation", and the latter (/3 = 0 
bits) is the result of a procedure called "text segmentation". Both of these will 
be briefly discussed later. 

• 7: This parameter, perhaps the most important of all, specifies the fraction of 
text characters which can be represented by, and hence communicated as, an 
existing library character. 7 will appear in all graphs as the abscissa, varying be- 
tween 0 and 1 . 

• X: A USEEM prototype library containing 256 elements requires 8/X bits/char- 
acter to communicate an identifier where X is the reduction factor obtained by 
noiseless coding of text. We will use the values of 1 (no text compression) and 
2, the latter supported for dense text by preliminary simulations. 


t Observe, we have left out the situation where some regions of text are incorrectly 
treated as non-text. However, such text would then be coded by 2-D facsimile with 
roughly the same results as being treated as text and being coded as "NEW" proto- 
type patterns (see Section II). 


Dense text only. Figure 1 3 displays plots of text compression factor wy versus y 
under the conditions of dense text only (corresponding to the CCITT standard test image 
(No. 4 in Fig. 2)). 

Included for comparison is a horizontal line coy = 7 labelled "facsimile". This repre- 
sents the expected performance of 2-D predictive facsimile techniques. 

The dashed curve represents a basic USEEM configuration using the simpler "char- 
acter segmentation" approach [0 = 10) and no text coding (X = 1 ). Even in this case a y 
- 0.9 yields a compression factor of 30:1 compared to only 7:1 by facsimile. At the 
bottom and where 0-0, the segmentation overhead reduces the compression factor to 
slightly less than facsimile. However, y need only be 0.08 to yield equal performance. 

The next curve shows the advantage of a text segmentation approach (0 ~ 0). At 
the low end (7 = 0 ) text segmentation raises performance slightly (to an equivalence 
with facsimile). The advantage of text segmentation increases rapidly at higher values 
of 7 where at 7 = 1 the (potential) advantage is over 2: 1 . In absolute terms USEEM text 
segmentation provides a 46:1 compression factor at 7 = 0.9 and 1 10:1 at 7 = 1.0. 

The impact of adding noiseless text coding (with X ~ 2) is shown by the next curve. 
Improvements to USEEM compression factors are negligible until 7 = 0.7, achieving a 
maximum gain equal to X = 2 when 7 - 1.0. In absolute terms, the combination of text 
segmentation and text coding yields overall compression factors of 55:1 when 7 = 0.9 
and 220:1 when 7 = 1 .0. The latter figure represents performance 30 times better than 
facsimile, 

Effects of text density. Figure 1 4 shows the effect of text density on the potential 
compression factor of a USEEM system equipped with text segmentation (0 = 0) and 
text coding (X = 2). The lower curve for dense text has been transferred from Fig. 1 3. 

Observe that in all cases, USEEM performance equals facsimile performance for 
7 — 0. At the other end, limiting performance for 7 -* 1 reaches almost 600: 1 for medi- 
um density text and over 1000:1 for light text. 
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FRACTION OLD CHARACTERS, 7 

Fig. 13. USEEM Compression Factor Estimates: Dense Text. 



0 0.2 0.4 0.6 0.8 1.0 

FRACTION OLD CHARACTERS, 7 


Fig. 14. Effect of Text Density. 



Variations in a. The impact of varying percentages of text and non-text regions 
within a document is shown for dense text in Fig. 1 5, medium text in Fig. 1 6, and light 
text in Fig. 1 7. 

In each figure the document percentage of text has been varied by letting t 

a = 0.0, 0.2, 0.6 and 1.0 (24) 


for 


o> NT « 7 and 20 (25) 

Compression factors for text regions assume that both text segmentation (/3 = 0) 
and text coding (X = 2) have been incorporated. 

We have just investigated the case where a = 0.0 such that a document is all text. 
Each of these correspondinf curves for dense, medium, and light text have been 
transferred to Figs. 15, 16, and 1 7, respectively. They appear as the uppermost curve 
(for high values of 7) in each case. 

By Eq. 23 when a equals 1 a document is considered ail non-text and compression 
factors equai the factor regardless of 7. Tnis case then appears as horizontal lines 
at cojvjj = 7 and cojvj-p = 20. 

Observe that the effect of high percentages in non-text can significantly alter the 
overall compression factor. The worst case situation occurs when the non-text regions 
can only be compressed by co|\j-p — 7 and the text itself is light (Fig. 1 7). For example a 
potential compression factor of 1000:1 (7 = 1) is reduced by almost two orders of 
magnitude to 13:1 when half the document is non-text (40:1 if = 20). 


tfhe non-text contributions to these graphs can also be viewed as text regions which 
have been incorrectly segmented. 
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Simulation Results 

Table 7 shows the results of actual USEEM performance runs on binary images con- 
taining only text. These include the extremely dense CCITT French document No. 4 in 
Fig. 2 and four medium-density letters generated by a word processor and then scanned 
at 200 points/inch. 

The repeat fraction y in Table 7 is the same parameter as in Figs. 13-17 except that 
these are actual results. First page fractions include the fact that the USEEM library 
starts out empty. Second page results assume an established library. 

The "'errors" indicated can be viewed as minor errors which would not be misinter- 
preted by a reader. Most of the CCITT errors were due to the special symbols of the 
French language which were not accounted for in the initial pattern recognition 
development. 


Table 7. Actual USEEM Performance. 



IMAGE 


CCITT 
NO. 4 

WORD PROCESSOR GENERATED 

1 

2 

3 

4 

CHARACTERS 

1 

IB9 

1500 

(MEDIUM) 

1500 

1500 

1500 

REPEAT FRACTION 
7 

0. 927 

0. 945 

0. 954 

0. 954 

0. 933 

1st PAGE 

0. 943 

0. 980 

0. 988 

0.988 

0. 972 

2nd PAGE 

Am iai 

COMPRESSION 
FACTOR co 
(NO TEXT CODING) 

53 

166 

182 

182 

150 

1st PAGE 

70 

232 

257 

257 

212 

2nd PAGE 

POTENTIAL co 
WITH TEXT 
CODING 

68 

222 

252 

252 

195 

1st PAGE 

90 

370 

439 

439 

319 

2nd PAGE 
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The dense document started at 7 « 0.93 on the first page and would increase to 
7 a 0.94 if a second page was presented. Compression factors (actual) were 53 and 
70 for first and second page respectively. Estimates of additional gains possible from 
text coding for first and second page respectively (X = 2) are shown. 

First page 7 values near 0.95 were produced on the foui medium-density docu- 
ments. By the second page, values close to 7 = 0.99 were achieved. These yielded ac- 
tual compression factors of from 1 50:1 to 250:1 . 

Adding text coding in these cases offers the potential for a range from 200:1 to 
400:1 (X = 2 ). 


SEGMENTATION 


The basic approach to the segmentation of textual material is to isolate individual 
characters and independently identify their locations. We call this approach character 
segmentation, if the context of a typed page is accounted for so that individual charac- 
ters belong to "text lines," then significant reductions in identification overhead 1(3 in 
the previous section) can be achieved. The output of this text segmentation process 
can, in fact, be made compatible with word processor formats, bringing scanned elec- 
tronic mail much closer to direct entry electronic mail systems. 


Character Segmentation 

The basic character segmentation approach of Compression Labs, IncJ 1 ^ is illus- 
trated in Fig. 1 8 and uses the sequence of text "then. It ..." to illustrate that the proto- 
type storage and processing of characters is in order of a) the scan line that first touches 
a character top (i.e., the tallest on a level line) and b) left to right. 

The horizontal position of each individual character must be communicated. Posi- 
tion is determined by the highest point on 0 character called a Keypoint, as shown in 
Fig. 1 9. t Unfortunately, the communication of character Keypoints in a somewhat ran- 
dom fashion contributes significantly to an unnecessarily high character segmentation 
cost of (3 = 10 bits/character. That adjacent characters have adjacent Keypoints is es- 
sentially discarded by this approach. 


t Height and width information of a character is assumed to be part of a prototype 
definition. 
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Fig. 18. Character Segmentation Ordering. 



' w 

Fig. 19. Character Segmentation Keypoint. 


Text Segmentation 

Handling spaces. A normal sequence of characters making up words consists of the 
regular alphabetical characters separated by small spaces of one or more pixels. A se- 
quence* of words consists of the words separated by larger spaces originally generated 
as "blank characters". 
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If one simply assumes that any character is followed by a character space then such 
spaces can be inserted as part of the reconstruction process without increasing com- 
munication cost. The longer spaces which originated as blank characters can be treated 
by defining a "blank" prototype in the USEEM library. 

Figure 20 shows a sequence of regular characters, character spaces, and a single 
blank character. Since any character is followed by a character space, the blank charac- 
ter has a character space on either side. We define the length of a character space to be: 

s' = Average of the minimum observed space (26) 

(in pixels) between regular characters. 

Now assume that the minimum observed blank area, L, greater than 3s must be a "blank 
character" and two character spaces. Then the length of a "prototype" blank character 
is 


B = L - 2 s 


(27) 


Both B and s can be determined quickly at the start of a document, if the font is 
unknown. 


The number of blank prototypes in an observed blank area of length L' is given ast 


n B 


L' - s 
B + s 


provided L' > 3s. This defines the prototype "blank" character. 


(28) 


The initial reconstruction process for a string of text would insert B 4 - s’, pixels of 
blank area for each prototype blank character transmitted. Character spaces of basic 
widths would follow each regular character. Any adjustment needed to fit the text 
precisely within a prescribed area (e.g., a line) could be accommodated by lengthening 
or shortening the actual reconstructed character spaces as needed (spread out across 
the area). 


There is nothing magic about s’ except that its use will tend to place reconstructed 
text over the same area. In fact, reconstruction could include proportional spacing, even 
if it didn't exist in the original. 


t fx"J is the smallest integer greater than or equal to x. 
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Fig. 20. Spaces in Text. 


Text baseline. Text is intended to run along horizontal lines; therefore, it can be 
assumed that some form of malfunction has occurred when this is not the case. In addi- 
tion, the base of most characters sits on the same horizontal line (i.e., a, b, c but not y, j, 
g). We define the latter horizontal line as the text baseline relative to the alphabet in use. 

When a character prototype is first entered into the USEEM library, it is stored with 
information relating the position of the prototype relative to the currently assumed "text 
baseline" (see Fig. 21 ). 

A prototype library is initialized only when a text baseline can be unequivocally 
determined as the scan line which has the most prototype baselines. Henceforth, text 
baselines in a local area (even tilted lines) can be determined from repeated prototypes. 
Conversely, the position of a prototype relative to a baseline can be used as a screening 
feature. 

Example. Figures 22 and 23 illustrate the advantage of text segmentation. Fig. 22 
is a copy of a letter in original form whereas Fig. 23 is the same letter after text segmen- 
tation. A close look will reveal that the original has numerous letters which have been 
run together (e.g., the m's in communication). This problem has been cleared up in 
Fig. 23 because the characters have been uniformly spaced as described in (26M28). 
Additionally, as already noted, the process of text segmentation would allow the 
straightening of crooked lines. 
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TEXT BASELINE IS THE SCAN LINE THAT THE MAJORITY OF CHARACTERS 
SIT ON. E.G., a. b. c. NOT y. j. g 


TEXT BASELINE 

PROTOTYPE BASELINE 

THE POSITIUN OF THE BOTTOM OF A CHARACTER PROTOTYPE RELATIVE 
TO TEXT BASELINE IS PART OF PROTOTYPE INFORMATION 



Fig. 21. Text Baseline. 
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Fig. 22. Original Cross Letter. Fig. 23. Text Segmented Cross Letter. 
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Figure 24 illustrates the basics of unsupervised character recognition. Input seg- 
mented bit patterns Pj n are compared to "prototype bit patterns" Pj, already stored in a 
library, by computing a "distance" between them. This distance measure D 2 KO as 
used here is a modified Exclusive-OR procedure developed by Compression Labs Inc. 
called Template Matching^ 1 01. The smallest D 2 value 

D = min D 2 (P jn , P.) {29) 

i 


is computed and if it is small enough 


D < T 2 


{30) 


a match with the corresponding Pj is determined. 



Fig. 24. Basic Character Recognition. 
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jf a match is determined, only a library identifier for Pj need be communicated. If 
there is no match, then Pj n becomes a new library prototype and must be communicated 
separately as a bit pattern (to be added to an equivalent library at the alternate site). 

Threshold T£. The basic reason for threshold T 2 is illustrated in Fig. 25 which dis- 
plays several conditional probability distributions of the distance measure D 2 . Each 
curve represents the distribution of D 2 comparing the patterns of a specific input char- 
acter "C" and the library patterns corresponding to specific characters. It should come 
as no surprise that the closest library pattern to an input "C" is another "C" as shown. 
However, an "O" is not too far off so that there is some overlap in the distributions. 
Deciding a match has occurred based only on choosing the minimum D 2 could then lead 
to errors since, as illustrated, there is some chance that a library "0" may be closer to 
an input "C" than a library "C". In fact, a library "D" may be still closer. The solution is 
to provide a threshold T 2 = X which avoids making such errors. 

Thresholding in this manner, while preventing errors, can drastically reduce the 
chances of a match. For example, setting a threshold T 2 = X in Fig. 25 means that the 
chances of matching an input "C" with a library 'C' are represented by the area in the 
crosshatch region. We will return to this problem in subsequent paragraphs. 



Fig. 25. Basic Thresholding. 


Basic Prototype Screener 


Performing the matching operations in Fig. 24 to find the minimum "D" would re- 
quire that each member of the prototype library be compared. This would incur an exor- 
bitant and unnecessary computational load. To avoid matching against each pattern 
Compression Labs Inc. implemented the basic "Prototype Screener" shown in Fig. 26. 

The function of such a screener, a standard technique in pattern recognition, is to 
avoid the matching operations on those prototypes which aren't even close. This is ac- 
complished by extracting and maintaining a feature vector Fj with each prototype. The 
corresponding feature vector, Fj n , of incoming patterns is compared to eanh Fj using 
distance measure D-j (•,•). If 


D 1 (F in' ^ < T 1 (3D 

the corresponding prototype bit pattern, Pj, is passed on to be compared in the matcher 
as described above. 

While reducing matcher computation requirements, this procedure does nothing to 
improve performance since it only excludes prototypes which would never have been 
chosen by the matcher. Matching performance is the same as if all bit patterns had been 
passed through. 

Two-Stage Character Recognition 

A return to Fig. 25 reveals a possible way to improve performance. As shown, the 
area under the C|(\j vs C curve above X represents the probability that a "C" will have to 
be communicated as a prototype pattern. 

The threshold could be raised if did not have to be matched with a prototype 
"0" or "D". This situation is illustrated in Fig. 27 where the threshold T 2 can be moved 
up to X', allowing all Cjjsj to be correctly classified as the library prototype C. 

A structure which achieves the desired situation is shown in Fig. 28. The prototype 
library is treated as made up of N classes. An incoming pattern is first pro-classified as 
belonging to one of these N classes. Subsequent steps of comparing feature vectors 
(within the selected class) to determine which patterns should be compared in the 
matcher is essentially the same as before. However: 
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NO MATCH 

Fig. 26. Character Recognition with Prototype Screener. 




Fig. 27. Raising the Threshold, D 2 . 
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SUBDIVIDED 



Fig. 28. Two-Stage Character Recognition. 


1 ) A different screener distance measure D-j !<(•,•), and threshold, T •] can be 
used for each class, and similarly; 

2) Matcher threshold T 2 can be adjusted for each class. 


Pre-classification tree. The pre-classification process is accomplished by identify- 
ing a new set of features which distinguish between characters that appear close from a 
matcher point of view, for example, the "C" and "0" in Fig. 25. 

The desired set of features can be placed into a decision tree as shown in Fig. 29. 

As shown, a character is identified as containing one or two parts. If there is only 
one part, does it contain enclosures? If not, is the character large or small? If it does 
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have enclosures, how many are there and where are they located relative to a "text 
baseline" (see previous section)? 

The result of an actual USEEM pre-classification is shown in Table 8. 


Observations. The implementation of the pre-classification tree within the frame- 
work of the structure in Fig. 28 had a dramatic impact on performance. Many characters 
which had been quite difficult to distinguish before by matching no longer posed a prob- 
lem since they were separated by the earlier steps. 

An unintentional but important bonus was an equally dramatic reduction in overall 
computation. The matcher would now have to perform an average of only two template 
matchings per incoming character. 


NOISELESS CODING OF TEXT 

Preliminary work was performed to assess the potential reduction in bits needed to 
represent the string of identifiers emanating from a USEEM prototype library. By the 
assumption of text segmentation, the output of such a library is essentially in word pro- 
cessor form, without the benefit of information specifying which identifier corresponds 
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Table 8. Actual Pre-classification. 


Class 

Characters Identified 
In Class 

0 

i j 

1 

f 5 / 1 7 t 

2 

• ( 

3 

0 o 0 D A 

4 

R e P p 9 

5 

a d b 

6 

u y N v w Y M 

7 

k 

8 

J C E S L G 

g 

8 g B 

10 

3 2 c T F 

1 1 

n h 

12 

m 

13 

r 


Observe that "C" has been separated from "0" and "D". 


to which alphabetical character. An assessment was made by defining a fairly complex 
coding structure and then using the output of an actual word processor to estimate ex- 
pected code performance. These results showed that a reduction of 2 to 3 times could 
be obtained. The factor X = 2 was used in earlier estimates projecting overall USEEM 
performance If text coding was fully implemented. 

Since these simulations were preliminary, we will only briefly discuss the text cod- 
ing structure used in these investigations. 

Code Structure 

Figure 30 illustrates the approach. Text composed of sequences of library identifi- 
ers (including one to indicate that a new library prototype pattern must be inserted) en- 
ters on the left. This data stream is then split into several new data streams to be separ- 
ately coded and then concatenated to form the complete output. The function of the lat- 
ter step is to allow the various non-stationary statistical components of USEEM text to 
be separately treated. 
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CHARACTER 
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CODER 7 


Fig. 30. Text Coding Structure. 


Data is treated as making up either words or non-word groups of characters. Non- 
word groups include blank characters, special infrequently occurring symbols, end-of- 
line markers, etc. Sequences of words and non-word groups are accompanied by coded 
data sequences identifying how many characters are in each group. The coding of word 
and character sequences follows a structure much like that of the USEEM prototype 
library itself. Identical word and character libraries are developed and maintained at 
sending and receiving sites. The purpose of this step is to "learn" the statistical charac- 
teristics of the language. Initial work accounted for the relative frequency of occurrence 
of words and characters as well as the correlation between adjacent characters (e.g., a 
V following a 'q'). 
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All of the steps leading to the boxes labelled "coder 1 " to "coder 7" act as prepro- 
cessors which produce memoryless sources with symbols 0, 1,2, ... such that 

P 0 £ Pi > P2 £ ••• (32) 

Is well approximated. This is the required condition to make effective use of the adaptive 
variable length coding techniques defined in Refs. 2-4. Hence, each of the "coders" in 
Fig. 30 can be a particular version of the latter algorithms. 
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SOFTWARE FOR UNIVERSAL NOISELESS CODING 


Robert F, Rice and Alan P. Schlutameyer 


Jet Propulsion Laboratory 
California Institute of Technology 
Pasadena, California 


Abstract 

Discrete da a foutccb arising from practical 
problems arc generally characterized by only par- 
tially known and varying statistics. Practical 
adaptive techniques for the efficient noiseless cod- 
ing of a broad class of such data sources have been 
developed at JPL. These techniques have now been 
implemented in ANSI- standard Fortran IV and made 
available to researchers through NASA's Computer 
Software Management and Information Center 
(COSMIC). 

This paper describes the software package and 
'.he algorithms upon which it is based. These algo- 
rithms have exhibited performance only slightly 
above all entropy values when applied to real data 
sources with stationary characteristics, However, 
performance considerably under a measured aver- 
age data entropy may be observed when data char- 
acteristics are changing over the measurement 
span. 

These easily Implemented algorithms are applic- 
able to virtually any alphabet size arising in prac- 
tice. A subset of these results is a large class of 
efficient adaptive coders for binary memoryless 
sources characterized by an unknown or varying 
statistic. 


INTRODUCTION 

Discrete data sources arising from practical 
problems are generally characterized by only par- 
tially known and varying statistics. Earlier 
papers^’ ^ provided the development and analy- 
sis of some practical adaptive techniques for the 
efficient noiseless coding of a broad class of such 
data sources. Specifically, these algorithms were 
developed for efficiently coding discrete memory- 
less sources which have known symbol probability 
ordering but unknown values. A general applica- 
bility of these algorithms to solving practical 
problems is obtained because most real data 
sources can be simply transformed into this form 
by appropriate preprocessing. 

As part of the evolution of this set of "code 
operators" described in Refs. 1 and 2 a Fortran IV 


software package was developed which basically 
matches each code operator and inverse with a 
pair of corresponding subroutines. The existence 
of this software, now available through the Com- 
puter Software Management anti Information Cen- 
ter (COSMIC)^, should ease the burden of a 
potential user to determine the applicability, per- 
formance and appropriate options of these algo- 
rithms for his specific c.*'. system problem. 

The intent of this paper is to provide an over- 
view of the universal noiseless coding algorithms 
as well as their relationship to the now available 
Fortran implementations. Readers considering 
investigating the utility of these algorithms for 
actual applications should consult both COSMIC and 
f ds. 1 and 2. Examples of applying these tech- 
niques are given in Refs. 4-6. 

Reversible Preprocessing 

Removing correlations . In real problems 
where samples of a data sequence are correlated 
with themselves or with a priori information, there 
is usually some simple transformation which re- 
sults in new sequences wherein the samples are ap- 
proximately independent. More important, the un- 
certainty in what the sample values will be is usu- 
ally greatly reduced. The less uncertainty there is 
the greater the potential for reducing the average 
bits required to code. Examples of such memory 
reducing operations include taking differences 
between adjacent samples along a television line, 
successive states of a Markov source or run 
lengths from a run length coder. 

Symbol probability ordering , Given q possible 
symbols resulting from correlation removing 
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operations it is a simple matter to first relabel 
them into the integers 0, 1, 2, • • • q-1. Then 
letting P = jpjj be the probability distribution of 

O, 1, 2, • ’ q~l we note that for a wide class of 
practical problems, the probability ordering of 
symbols is a priori known (or at least well ap- 
proximated), In fact for many problems this 
ordering tends to change very little even as the 
actual P values m.-.y be changing dramatically 
(consider the independent difference samples along 
a television scan line). It is then a simple matter 
to relabel source symbols, if necessary, so that 
the following conditions are well approximated, 

P 0 > Pi 2 P 2 ••• 2 Pq.l (1) 

Changing P . Most real world problems are char- 
acterized by changing and poorly defined values of 

P. P may vary simply because short sequences 
are the result of preprocessing different data 
sources. There may be long and short term sta- 
tistical variations in a single data source. Other 
than meeting condition (1), P may not be known at 
all. 

The modified codi-g problem may now be sum- 
marized in Fig. 1. The net result of correlation 
removing operations and symbol relabeling is, fo" 
many practical problems, an approximately mem- 
oryless source with symbols 0, 1, 2, • • • q-1 with 
generally unknown and varying probability distri- 
butions approximating condition (1). It is the effi- 
cient coding of such modified sources that the code 


operators of Refs. 1-1 provide practical solutions 
for. 

Notations and Definitions 

Using the notation of Refs. 1 and 2, 

'/'(X) (2) 

will be used to specify the length of any sequence 
X in samples or in bits (of its standard fixed 
length binary representation if X is not already 
binary). 

Performance measures . Given the discrete 
symbol probability distribution P = |pj[ the entro- 
py II(P) is defined by 

H(P) = - ^ ^ p^ log 2 p, bits/sample (3) 

i 

When properly used, H(P) can be a useful practical 
tool in assessing how well a particular coding al- 
gorithm performs. 

If Z is an infinite sequence of samples from a 
memoryless source with fixed and known symbol 
probability distribution then H(P) represents the 
minimum possible expected bits/ sample required 
to represent Z using any coding technique. But as 
we have just noted, most practical problems which 
can be transformed by preprocessing into equiva- 
lent memoryless problems are characterized by 
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Fig. 1. Reversible Preprocessing 
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changing or possibly unknown distributions. In 
practl e, it is generally difficult if not impossible 
to meaningfully model the way in which P changes, 
although the fact that it changes may be quite ob- 
vious. Consequently, the equivalent "bounds" for 
real data sources with changing P are difficult to 
come by. 

Except where explicitly noted, the stated per- 
formance of a particular code operator will be 
based on measured performance using real data. 
H(P) provides the desired practical measure of 
performance where P is the average symbol prob- 
ability distribution over the measurement span. 

If the real data has a somewhat uniform statistical 
character over the measurement span then H(P) 
represents a practical bound to average per sam- 
ple performance of any code operator. An algo- 
rithm is performing efficiently If its measured 
average performance is close to H(P). However, 
if data character changes significantly over the 
measurement span, it may be possible to obtain 
average per sample performance under the mea- 
sured H(P) by adapting the coding to suit the 
changes. H(P) is still a useful guide in those 
cases and does in fact bound the best performance 
available with a single code (e.g. a Huffman code 
designed for P). 


Code operators . A code operator is a revers- 
ible mapping of a preprocessed data sequence 
(Fig. 1) into a binary output sequence. Code oper- 
ator structures developed in Refs. 1 and 2 gen- 
erally have many possible internal parameters. 
The notational convention adopted for identifying 
operators (structures) is to subscript and super- 
script the symbol psi (4d and occasional other 
symbole and implicitly anume that a detailed 
specification could be obtained by reference to a 
parameter string (e.g. the calling parameters of 
the COSMIC software). For example the opera- 
tion of code operator L * D on data sequence X 
produces the coded binary sequence 




(4) 


requiring 


ORIGINAL PAGE IS 
OF POOR QUALITY 

SP(fj[X]) bits (5) 


? can be decoded by applying the inverse opera- 
tion + j ' 1 [0 so that 


+ ’ 1 [Y] = X (6) 

Elements of a parameter string generally needed 
in the definition of would be the input alpha- 

bet size, q, and the length of X. Many other 
parameters may be required depending on the spe- 
cific + .[•] . 

Estimator and bound . An extremely practical 
characteristic of these noiseless code operators 
is that estimates (actually bounds) of actual per- 
formance can be obtained basically as simple 
functions of the sum of input samples. This can 
simplify performance assessments as well as aid 
in the creation of new algorithms. The notation 
convention adopted for identifying these estimators 
is to subscript and superscript gamma (Y) in the 
same manner as the corresponding code operator. 
Continuing the example above we have the estimate 
and bound 


Y j (X)=//’(g; j [X]) (7) 

and 

Vj(X) > y'4 jCx]) (8) 

Performance Summary 

The potential user of these code operators 
would first seek to model his data source to devel- 
op an appropriate "reversibLe preprocessor" which 
meets the objectives outlined in Fig. 1. The re- 
maining problem is then to efficiently code (i. e. 
close to the entropy, H(P)) a discrete memoryless 
source with varying or unknown entropy values. 

The algorithms in Refs. 1-3 offer a practical 
solution to this problem by providing options 
which exhibit efficient performance for any range 
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of data entropies, A special subset of these algo- 
rithms is a cLass of binary memoryless code op- 
erators capable of performing close to the binary 
entropy function as the (a priori) unknown proba- 
bility of a zero or one varies between 0 . 0 and 1 , 0 . 

In each application a user may avoid unneces- 
sary complexity by selecting operators which as- 
sure efficient performance only over the entropy 
range of his specific problem, 

CODE OPERATORS 

The discussions within this section assume that 
input data is the result of reversible preprocess- 
ing operations as described in Fig. 1. 

Basic Compressor 

The "Basic Compressor" is defined as code 
operator ^[’j • ^[*] selects between four code 

operators, + 0 [>], > and + 3 [*] - choos- 

ing the operator which will most efficiently repre- 
sent the input block of samples X. The result of 
applying to X is given by the binary sequence 

* 4 [X] * ID *+ ID [x] (9) 

where * denotes concatenation, tjj ID [x] is the 
binary output sequence resulting from the applica- 
tion of the selected code operator 
to X, and ID is a two bit binary sequence (00, 01, 

10 or 11 ) indicating which code operator was 
chosen. performs efficiently over the 

entropy range of approximately 0. 7 to 4. 0 bits/ 
sample because at least one of the four "options" 
performs well at any given point of this range. 

4 < 4 C * ] thus adaptive. The parameters needed 

for are the input block length J and the num- 

ber of values that a sample of X may take on, q. 

A smaller block size increases the rate at which 
the code options may be changed but also incurs a 
higher per sample overhead of 2/J bits/sample. 

For most applications net performance is rather 
insensitive to changes in J for all but very small 
or very large J. A convenient practical choice is 
J=16, 

Typical average measured performance for the 
Basic Compressor is shown in Fig. 2 com- 



Fig, 2. Typical Average Performance of Basic 
Compressor 4' 4 [‘] 

pared to a plot of measured average entropy, H(P). 
Note that when distributions are stable performance 
lies slightly above H(P), whereas when distribu- 
tions vary considerably performance under H(P) 
may be observed. 

nelated operators . Given a sequence Y of N 
preprocessed samples, operator ^['J will first 
partition Y into blocks 01 J samples each (length- 
ening the last block if J does not divide N) and then 
code each block with + 4 [*] . 4*5 ['] * s simply a 

special case of 4 J g[ 1 ] where J divides N. 

Higher Entropies 

Operator adds to 4^[ - ] additional "split- 

sample" modes which extend efficient performance 
above 4 bits/ sample entropies. 4'gC‘J by itself is 
considered split-sample ‘l 1 . A coded Y sequence 
is preceded by a mode identifier in much the same 
manner as for in (9). The first bit of mode 

identification provides one additional mode and 
extends efficient performance upwards by one bit/ 
sample to 5 bits/ sample. Adding a second identi- 
fication bit allows two more split- sample modes 
and extends the range of efficient performance by 
another two bits/sample, and so on. There is no 
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fundamental limit to this operation. It is up to the 
user to select the number of modes which meet hiB 
objectives. 

A special simplification of C * 1 can 1)6 defined 
when N=J. This operator, ^gt'] . has been imple- 
mented in an application to NOAA weather satel- 
(5) 

lite images and is part of the image compres- 
sion system on the Galileo Project. ^ 


of epp [ • ] yields ^gC’] performance characteristics 
shown in Fig. 4 for data with slowly changing 
characteristics and input alphabet size of 256. 

The operator ip x Q C '] chooses between ‘t'gC'] 
and +g[’] . Roughly the same performance i3 ob- 
served but certain subtle implementation advan- 
tages emerge. The reader is referred to Refs. 1 
and 2. 


Lower Entropies 

+g[’] provides the ability to achieve efficient 
performance at very low entropies without sacri- 
ficing performance at the higher entropies. The 
structure of I s illustrated in Fig. 3 for a 

preprocessed data sequence Z. 

As shown, theSPLITf 1 ] function splits Z into 
a sequence of the non-zero samples 0 and a binary 
sequence D which identifies which samples of Z 
are non-zero. Separate coding of D and § using 
ipp [ * 3 and tq[' ] yields the desired result. 0 has 
characteristics satisfying the preprocessing re- 
quirements of Fig. 1 so that iPq[ ■ 3 can he identi- 
fied as a form of+gf']- D appears as a binary 
memoryless source with unknown and changing 
statistic 


P 0 = ?? 


sample of D 
equals zero 


( 10 ) 


in Fig. 3 is a class of code operators to 
efficiently code such sources. Appropriate choice 


si l 

i I 



Fig. 3. General Form, Operator i}Jg[’] 


Binary Coders 

The coders 4jp[‘] noted above are generally 
useful with or without the structure o •*(}[•] pre- 
ceding them. Basically 4>p[-] first prep-ocesses 
the binary data into a form which allows the non- 
binary operators ^[‘3 ■ Vjq[ - ] to be used. A 
binary operator of this form which used+gf’] in- 
ternally following this additional preprocessing 
would be labeled +p[‘] w’hereas one that used 
+ 4 [.] internally would be labeled up['] . Since 
or i>p^[*]use a ^gf/] operator internally 
they must also employ a binary operator inter- 
nally (see the structure of i!Jq[' 3 * n Fig. 3). This 

4 9r i 

internal binary operator could also be vpL'J ar >d 
so on. Thus the Vp[’] or vp°H structure could 
allow for an infinite tree of code operators. From 



AVERAGE ENTROPY, H(P>, BITS/SAMPLE 
Fig. 4. Measured Average Performance, 4>g[‘] 
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a performance point of view, each additional level 
of 4 )p['] or 4ip^['] improves performance in the 
vicinity of zero entropy. A user must assess 
whether his particular data problem merits the 
added complexity. The COSMIC routines allow for 
U P to t ^ roe levels deep. A bound to the 
performance of such a code operator for an ideal 
binary memoryless source with unknown pg is 
shown in Fig. 5 in comparison to the binary en- 
tropy function 

H p(Po) 3 *P 0 l0g 2 ?0 ‘O-Po) log 2 O’Po) 

( 11 ) 

A simplification of<pp[-] can be obtained when 
it is a priori known that p Q 2 1/2 or p Q s 1/2. 

Such a modified operator is identified by if / q t C ‘ 3 ■ 

P 

SOFTWARE RELATIONSHIPS 

The COSMIC software package^ implementing 
the noiseless coding algorithms described above is 
written in ANSI Fortran IV. It has been run in 
present form on an SEL 32/55 and an IBM 370/ 158 
and we believe it could be used on most other sys- 



«Ol*»mrv O' mo, p 0 

Fig. 5. Bounds to Expected Performance of 
Multi-Lev, »l tp jaPf*] on Ideal 
Memoryless Source, Unknown 
but Constant Pg 


toms without modification. All computations are 
done using integer arithmetic. Binary input and 
output strings are accomplished using 1-bit-per- 
integer-word (16 bits) representation. While 
demanding more memory than 1 -bit-per-bit rep- 
resentation the processing is generally much 
faster using Fortran. 

These routines were written with the research 
and/or engineering user in mind. As such they 
have not been optimized for execution efficiency 
or memory conservation. A conscientious appli- 
cation of optimizing techniques would doubtless 
effect significant improvements. 

General Implementation Instructions 

The software package consists of two files of 
Fortran-IV source code. The first file contains 
all the subroutines for coding and decoding, and 
comprises about 2800 Fortran source statements. 

It is most convenient to compile the routines and 
place them in a disk subroutine library for subse- 
quent use by calling programs. 

A second file provides a test program. It is 
recommended that this program be compiled and 
executed to test the subroutines in file 1. to en- 
sure they are operating correctly. 

Neither file contains job control statements 
since these will vary from machine to machine. 

The required operations of compiling, cataloging 
and executing should be familiar enough to any 
programmer to make the job control problem an 
easy one. 

Naming Conventions 

The subroutines provided are named by a stan- 
dard naming convention. Basic support sub- 
routines have functionally mnemonic names. The 
names for coders, decoders and estimators were 
chosen to closely match the notation established in 
Refs. 1 and 2 as well as earlier sections. For 
example, subscripting and super scripting psi (y) 
to identify code operators is replaced by subroutine 
names beginning with 'SI' and ending with 'I' if the 
routine is a decoder (inverse). Letters or num- 
bers between 'SI' and 'I* modify the name further. 
For example operator ipj q C ’ J with two levels of 
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Internal tree structure becomea SI102 in software. 
Similarly the docodor for binary operator +p^[’] 
with two levels of internal tree structure becomes 
SB102I (where the 'I' in SI was dropped to enable 
a six-letter word). 

The estimators for operator performance, 
identified by subscripting and super scripting 
gamma (V) in Refs. 1 and 2 become GAxxx in the 
COSMIC software. 

In summary, the naming convention for sub- 
routines should enable a smooth transition from 
technical definitions to practical software applica- 
tion, Further details are provided in Ref. 3. 

Calling Arguments 

In order to facilitate understanding of the vari- 
ous coding, decoding and estimating routines, call- 
ing argument names and variable names within the 
Fortran code have been used consistently as much 
as possible. Each subroutine contains doev ’enta- 
tion describing the calling argument? speci -• to 
itself. More general documentation concerning the 
rules of use of the arguments, their inter-relation- 
ships and the overall structure of the subroutine 
set is provided as part of the software distribution 
package available from COSMIC. 

ACKNOWLEDGMENT 

The research described in this paper was car- 
ried out by the Information Processing Research 
Group of the Jet Propulsion Laboratory, California 


Institute of Technology, and wao sponsored by the 

U. S. Postal Service through an agreement with the 

National Aeronautics and Space Administration. 

REFER ENC ES 

1. R. F. Rice, "Some Practical Universal Noise- 
less Coding Techniques, " JPL Publication 79 -22 , 
Jet Propulsion Laboratory, Pasadena, CA, 
March 15, 1979. 

2. R, F. Rice, "Practical Universal Noiseless 
Coding, 11 SPIE Symposium Proceedings . Vol. 

207, San Diego, CA, August 1979. 

3. A. Schlutsmeyer, R. Rice, "Universal Noise- 
less Coding Routines, " Case NPO 15451, Com- 
puter Software Management and Information 
Center (COSMIC), Athens, Ga. , November 1980. 

4. R. F. Rice, "An Advanced Imaging Communica- 
tion System for Planetary Exploration, 11 Vol. 66 
SPIE Seminar Proceedings . Aug, 21-22, 1975, 
pp. 70-89. 

5. R. F. Rice, A. P. Schlutsmeyer, "Data Com- 
pression for NOAA Weather Satellite Systems, " 
Vol. 249, Proceedings 1980 SPIE Symposium , 
San Diego, CA. , Tuly 1980. 

6. R. F. Rice et, al, "Block Adaptive Rate Con- 
trolled Image Data Compression, " Proceedings 
of 1979 National Telecommunications Confer- 
ence , Washington, D. C. , Nov. 1979. 


55 




ORIGINAL PAGE IS 
OF POOR QUALITY 

APPENDIX B 

DIRECTED MARKOV PREDICTION 

The following discussion describes the concepts behind the Bell Labs Markov Pre- 
dictor^ and an extension called "Directed Markov Prediction." 


STANDARD MULTI-STATE PREDICTOR 

The standard approach to a 1 6 state predictor, much like the Bell Labs predictor, is 
diagrammed below in Fig. B-1. 

Here y is the binary sample to predict and the |xj} are known data samples to be used in 
the prediction. The jxjftake on 16 possible states, Sj == x-j X2 X3 X4 (e.g., Sq = 0000, 
S-j = 0001, etc.). 

For a given state, Sj, the prediction y p is chosen such that 

Pr [y = y p | Sj] = max j Pr [y = 0|Sj], Pr [y = 1 |Sj]f > % 

(B-1 ) 

= Pr [correct] state Sj ]. 

The overall probability of being correct is then 

P c = ^ Pr [Sj] Pr [Correct | Sj]. (B-2) 

i 



Fig. B-1. Prediction Samples. 
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One should expect to improve performance by increasing the state set Sj. Others 
have done so to a limited degree but have been limited by the exponential growth of 
states as more samples are added. Ideally the rather large crosshatched area in Fig. 1 
(perhaps encompassing whole letters) should be used in the prediction. The following 
approach seeks to obtain most of these benefits without the corresponding complexity 
problem. 


EXPANDING THE USE OF SURROUNDING DATA 

Letting Sj remain as any one of the original 1 6 states and f as any one of the known 
data samples of the crosshatched region of Fig. B-1, we can rewrite (B-1) as follows: 

Pr [y = y p |S|] = Pr [y = y p AND |f = 1 OR f = 0(|S,]. (B-3) 


This can be written as 

Pr [f = 1 jS,] Pr [y - y p |Sj, f = 1] + Pr If = 0 1 S,] Pr [y = y p |Sj, f = 0]. (B-4) 
Now choose f * such that 


Pr Iy - y p |S;, f *] < Pr [y = y p |S j# f *]. (B-5) 

(That is, knowing f - f * yields a smaller probability that y equals the original prediction 
y p than does knowing that f equals the complement of f *.) 

Note that only the left hand side of (B-5) can be less than Vi , otherwise (B- 1 ) could 
not be true. 

Change of Decision 

Now find the set x(Sj) of ail f that satisfy 

Pr [y = y p |Sj, f*] < Vi. (B-6) 

For each f = f * the best prediction should be changed from y = y p to its complement 

V 


d 
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This changes state S| probability of error from 


k*? ■' 


P e IS|) = Pr [f == riSj] )1 - Pr ly = y |S,. f*]| + {term when f = rj (B-7) 


to, 


p e (Sj, r = n = Pr ir = r is,i pn y = y p |Sj, rj + |.f. (b-b> 

The improvement is given by 

AP e (S i' f = r I = Pr If = r is,l {1 - 2 Pr [y = YpIS,, f = r ] 1 . IB-9) 

Now order the fex(Sj) by the measure in (B-9). The first one on the list will provide 

the largest improvement in P_. 

0 


At this point we have a predictor which will make its original prediction for each 
state Sj unless the selected fex(Sj) for that state has a value f # , in which case the 


prediction is complemented, 
may be different for each Sj. 


The actual selected samples from the crosshatched region 


Extending the Tree 


By selecting the f *ex(Sj) for each state Sj we have started the first branch of a pre- 
dictor tree. To extend these branches note that the set of data samples fex(Sj) are those 
data samples which would not alter our initial decision. We can in fact order them by 
how much they tend to make the original decision more certain (those fex(Sj) with 
values f * we think can be excluded). 


Now just as we found the f = f *ex(Sj) which caused a decision change from yp to 
y p we find the f *ex(Sj) which cause us to change that decision back, picking the one 
with the biggest impact. (Here conditioning is on the new states { Sj, $■*} .) 

The tree can also be extended from the state (Sj, £*( for which we did not change 
initial prediction, y p . The candidates here are of course the unused elements of the or- 
dered fex(Sj). Intuitively the best candidates should be at the top of the list. 

The general procedure is outlined in Fig. B-2 where we have subscripted the differ- 
ent f * by letters. 
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PICK f* 
FROM X(S,) 


PICK 

FROM X (S.) 


EXAMPLES 


Expanding Fig. B-1 to include a numbering of pixels in the cross hatched region 
leads to the diagram in Fig. B-3 which shows 56 additional pixels which would be used 
for prediction. 
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Fig. B-3. Arrangement of Potential Predictor Samples. 
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In the following we will investigate extending state Sg to include additional 
samples. This is a result taken from a 256 x 256 region. The basic state Sg information 
is displayed in Fig. B-4. 


1 1 0 POPULATION 227 
0 y PREDICTION y p = 0 

ERRORS » 22 


Fig. B-4. State Sg Prediction. 


The next step is to search all surrounding Sg samples to see if adjoining any one of them 
to Sg can improve our prediction ability. 

In this example sampie Z43 is the best (and only choice). The predictor for state Sg 
becomes that shown in Fig. B-5. 


Z 48 

1 

1 0 


0 



Fig. B-5. Extended Sg Predictor. 


For all the 227 times Sg occurs, Z48 = 1 33 times and zero the remaining 194 
times. But of the 33 times Z4g = 1 , y equals the original prediction y p = 0 1 7 times. 
Since 1 7/33 > 0. 5 we should instead predict y p = 1 when Z4g = 1 . This result is de- 
picted below in Fig. B-6. 

While this reduced the number of errors by only one, we can branch again from the 
newly defined state Sg and Z4g = 1 . The best option in this case turns out to be Z40. 
The result is shown below in Figs. B-7 and B-6. 

As shown when Z4Q = 1 the prediction is changed back to y p = 0. The number of 
total errors is now reduced to 9. 
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Fig. B-6. Adding z 4 g. 



Z 40 

Z 4 8 

1 

1 

0 


0 



Fig. B-7. Extended S g, z 4 g - 1 Predictor. 


A predictr.,' using this approach would check z 4 g when state Sg was present. If 
z 4 g = 0 the predictor would predict zero and go onto the next sample. If z 4 g = 1 it 
would next check z 4 q. If z 4 q = 1 it would predict zero, whereas if z 4 q = 0 it would 
predict 1 . 

Our preliminary average results are not as overwhelming as this example. For a 
double branch as above, results suggested that a basic 1 6 state (Bel! Lab) predictor 
would require a 10 to 20 percent greater rate. 
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POPULATION - 14 
ERRORS » 1 



Fig. B-8. Adding z 4 q. 
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